OOCOHBBT BESOBB 



ED 17* 6*7 



TB 009 n95 



AOTBOB 
TITLE 



INSTITOTION 
9EP0BT 8C 
POE CITI 
BOTE 

AVAIL a EL Z JBOM 



ECBS PBIC5 
DESCBIPTOBS 



IDEBTIFIEBS 



Bnllins, C«cil J-; ABd Others 

Personnel Bating Effectiveness as a Function of _ 

Number of Bating Statenects. Final Report for Period 

21 February 1978 — 31 December ^91B. 

Air Force Human Bescurces Lab., ErooJcs AFB, Texas. 

aFHBL-TB-79-1 1 

Bay 79 

National Technical Informatior. Service, SpringfieU, 
Virginia 22161 

BF01/PC01 Plus Postage. 

Cogni-^ive Ability; CcitBunicaticn Skills; Evaluation 
criteria; Job Skills; ♦Leadership Qualities; *Officer 
Personnel; Peer Evaluation; Personality Assessment; 
♦Personnel Evaluation; ♦Predictive Validity; Prcfile 
Evaluation; ♦Rating Scc.les; Test ConstructioTi 
Air Force; ♦Test Length 



ABSIBACT , 

Besearch cn the cdrparative utility of varying 
nuBbers cf rating statements per set--using an external criterion of 
recognition of rater profiles fcr evaluating "gccdness" of the 
se+s— was conducted on 132 noncommissioned Air Force officers. Groups 
cf subiects rated groups of their peers on 20 factors, on a subset of 
10 factors, and on a subser of five factors. Bating statements 
included: learning ability; leadership; quality of work; motivation; 
abil^^y to follow instruotions; bearing and behavior; accuracy; oral 
ccamunicaticn; problem analysis; initiative; quality of work; written 
communicaticn; punctuality; adaptability; dependability; emotional 
stability; human relations; judgment; knowledge cf duties; and 
honesty. Profiles, based on group ratings, were _^eveloped for each 
sab-ject and given to the group members to identify. Profiles made 
froB 20 rating statements were identified no better than profiles 
developed from 5 rating statements. Also, the use of two rating 
statements, learning ability and knowledge of duties, were fcund to 
produce inforiation about a rate which could net be improved by the 
addition of many more rating factors. When the external criterion was 
used, results indica+ed *hat sets of statements larger than five did 
not FE^cvide better recognition of peers. (KH) 



♦ Fecroductions supplied by EDPS are the best that can bo made 

* from the original iocument. 



ERIC 



AFHRL-TR-79-11 



AIR FORCE e 



1^ 



UJ 



H 
U 
M 
A 
N 

R 
E 
S 
0 

U 
R 
C 
E 

S 



U $ OEFAKTMENT OP HEAt.tM, 
COUCATION ft WEL^Ane 
NATIONAL INSTITUTE oP 
EDUCATION 

THIS OCCUMENT HAS 8EEn llEPffO' 
OUCEO EXACTLY AS RECEIVE^ 
THE PERSON OR ORCANtZATiON 0««Cl''' 
ATiNO ri POrNTS OF VIEW OR oP"*IONS 
STATED DO NOT NECESSARILY ffEPsg- 
SENT OFFICIAL NATIONAL I NSTi Tt^^E 0^ 
EDUCATION POSITION OR POLlC^ 



PERSONNEL RATING EFFECTIVENESS 
AS A FUNCTION OF NUMBER 
OF RATING STATEMENTS 



By 



Cecil J. Mullins 
James A. Earles 
James M. Wilbourn 



PERSONNEL RESEARCH DIVISION 
Brooks Air Force Base, Texas 78235 



May 1979 

Final Report for Period 24 February 1978 — 31 December 1978 



Approved for public release; dUcribucton unlimited. 



LABORATORY 



ERIC 



AIR FORCE SYSTEMS COMMAND 

BROOKS AIR FORCE BA5EJEXAS 78235 

2 



NOTICE 



When VS Government drawings, specifications, or other data are used 
for any purpose other than a definitely related Government 
procurement operation, the Government thereby incurs no 
responsibUity nor any obligation whatsoever, and the fact that the 
Government may have formulated, furnished, or in any way supplied 
the said drawings, specifications, or other data is not to be regarded by 
impUcation or otherwise, as in any manner licensing the holder or any 
other person or corporation, or conveying any rights or permission to 
manufacture, use, or sell any patented invention that may in any way 
be related thereto. 

This final report was submitted by Personnel Research Division, under 
project 2313, Force Acquisition, Assignment, and Evalution with HQ 
Air Force Human Resources Laboratory (AFSC), Brooks Air Force 
Base, Texas 78235. Dr. Cecil J. Mullins (PEP) was the Principal 
Investigator for the Laboratory. 

This report has been reviewed by the Information Office (OO and is 
releasable to the National Technical Information Service (NTIS). At 
NTIS, it will be available to the general public, including foreign 
nations. 

This technical report iias been reviewed and is approved for publication. 

LELAND D. BROKAW, Technical Director 
Personnel Research Division 



RONALD W. TERRY, Colonel, USAF 
Commander 



Unclassified 



SECURITY CLASSIFICATION OF THIS PACE fWTien Dmtm Entered) 



REPORT DOCUMENTATION PAGE 


READ INSTRUCTIONS 
BEFORE COMPLETING FORM 


1. REPORT NUMBER 

AFHRL.TR-79-1 1 


2. GOVT ACCESSION NO. 


3. RECIPIENT'S CATALOG NUMBER 


4. TITLE (mnd Subititm) 

PERSONNEL RATING EFFECTIVENESS AS A FUNCTION 
OF NUMBER OF RATING STATEMENTS 


S. TYPE OF REPORT 6 PERIOD COVERED 

Rnal 

24 February 1978 -31 December 1978 






6. PERFORMING ORG. REPORT NUMBER 


7- AUTHORr»; 

Cecfl J. Mullins 
James A. Earles 
James M. >^boum 




9. PERFORMING ORGANIZATION NAME AND ADDRESS 

Personnel Research Division 

Air Force Human Resources Laboratory 

Brooks Air Force Base« Texas 78235 


10. PROGRAM ELEMENT. PROJECT, TASK 
AREA 6 WORK UNIT NUMBERS 

61102F 
2313T616 


11. CONTROLLING OFFICE NAME AND ADDRESS 

HQ Air Force Human Resources Laboratory (AFSC) 


12. REPORT DATE 

May \y/y 


Brooks Air Force Base, Texas 




13. NUMBER OF PAGES 

18 


U. MONITORING AGENCY NAME ft ADDRESS^// dittvrvnt tram Conirotttng Ollico) 


15. SECURITY CLASS, (ot tbte report) 

Unclassified 






15« DECL ASSI F|C ATIDN/ DOWNGRADING 
SCHEDULE 


16. DISTRIBUTION STATEMENT (ot thi a Report) 






Approved for public release; distribution unlimited 






17. DISTRIBUTION STATEMENT (o1 th» mbttract '.Mvfmdtn Block 20, It dittmreni trom Report) 


18. SUPPLEMENTARY NOTES 


19. KEY WORDS (ConilnUB on tevtrmc aide it nrc^anmry arid identity by block nurpber) 

personnel ratings 
rating dimensions 
rating factors 
rat ing mult iple lac t ors 
rating statements 


20 ABSTRACT (Conttnum on revrntBrn aid* It neceeemry mnd identity by btock number) 

Previous work on sets of personnel rating statements leave unanswered the question of whether there is any 
(Iv^ifiiage in using several ""factor*^ rating statements over the use of a single statement. This is a study of the 
toinparativc utility of sets of rating statements varying in number of statements per set« using an external criterion. 

A great deal of research effort has been expended in an effort to find **best** factors for collecting rating data. 
Most of this research has concentrated on internal psychometric characteristics of the rating data« such as means, 
standard deviations, and reliability coefflcients. When internal psychometric considerations constitute the sole 
criterion, som^ small advantage is frequently found for one kind of rating statements over another. When external 
criteria for evaluating ^'goodness** of rating sets are applied, there are usually no differences found among sets. 


DD 1 J AN "73 1473 EDITION OF 1 NOV 65 IS OBSOLETE 


Unclassifled 



SECURITY CLASSIFICATION OF THIS PAGE flWi»n Dmtm Entered) 



ERIC 



Unclassified 

SECURITY CLASSIFICATION OF THIS PAGEfWTiW P«<« 



Item 20 Continued: 



The subiects for this study were 132 students at the NCO academy at Uckland A. B. ITiree treatment 
The «^''J"=<^ ""^ 45 ^ere required to rate their peers in their 15-man study groups on 

S « rf^Tr, t:tt?4i ut^^^ on"0 factors, which were a subset of the 20 factors used by 

Ae fi^Tt'X Stffl^^o^MS sibjects rated their peers on a subset of only five rating factors. From the raUn^ 
D^^raes ^re dCTdoped for each subject indicating how that individual had been rated by a peer gro"P- The ^ 
prS Zl^ nStifying informat on on them, were handed out to the group members, who were required to 
fdenS' tTem A reSrd wj kept of all conect identifications. Analyses of variance were done to «e if there were 
t S^^i 5ifrenc<:ion'g the three groups. In addition correlation coefficients were ^^^^^ -e^^^^^^ 
varioJ^ts of rating statements and a criterion of class standing upon graduation, to see whether 20 statements 
predicted this criterion better than 10 and whether 10 statements were more predictive than five. 

The analysis of variance portion of the study produced no significant differences among the 8'°"^- 
multipk i'^re^iorrysL indicated tha 

of the other 18 to the predictor pool generated no useful additional prediction. The results of this stuay inQicate 
It a vei"my«:t of 'rating facfors g., two in this study) produce information about a ratee which cannot be 
improved by the addition of many more factor statements. 



Unclassifled 



SECURITY CLASStf ICATION OF THIS PAGE(Wh»n D»ta Enfr.d) 



TABLE OF CONTENTS 

I. Introduction , 

11. Method 

Rating Rationale 

Subjects 

Procedures 

in. Results and Discussion 

References 

Appendix A: Evaluation Form 

LIST OF TABLES 

TaWe 

1 Number of Profile Identifications (Hits) by Treatment and by Seminar Group 

2 Analysis of Variance of Number of Correct Profile Identification (Hits) 

by Treatment and by Seminar Group 

3 Analysis of Variance of Squared Deviations between Unidentified Profile 

Rankings and Peer Rankings by Treatment and by Seminar Group 

^ Intercorrelations, 20 Rating Statements, and Final School Grade 

5 Intercorrelations, 10 Rating Statements, and Final School Grade 

6 Intercorrelations, Five Rating Statements, and Final School Grades 

7 Regression Analyses of Varying Numbers of Rating Statements 



PERSONNEL RATING EFFECTIVENESS AS A FUNCTION 
OF NUMBER OF RATING STATEMENTS 



I. INTRODUCTION 

The literature contains a large number of studies issuing from the search for appropriate 
rating constructs to be used in the collection of rated data. The resuhs have been rather 
disappointing, but still the search goes on. The pursuit of rating constructs (or ''factors'') is 
probably due to the enormous influence of Thurstone's work with the factor analysis of test data 
and to his conclusion that complex human characteristics can best be explained in terms of a few 
orthogonal factors-that is, factors which are not correlated with each other. American 
psychologists, in general, have accepted Thurstonc's position. Those who have worked with rating 
data have started from the assumption that the copxept of orthogonality is almost a natural law. 
If one accepts that assumption, it is reasonable that one of the primary goals of rating research 
has been to find tliat set of independent (orthogonal) constructs which best describes human 
behavior when rating data are used. It is, alter all, merely an extension into rating data of a 
principle which has been accepted broadly as a fundamental concept in test data. 

There are, however, at least four major difficulties that have beset researchers in their quest 
for simple structure in rating data. These difficulties are as follows: 

1. Ortfmonality as a Concept, Althougli the concept of orthogonality as a requisite for 
factors has been persuasive to American psychologists, not all prominent modern psychologists have 
succumbed to the attractiveness of Thurstoncs arguments for the primacy of specific or orthogonal 
factore to describe human abilities (e.g., Horn, 1%8; Humphreys, 1962; Jensen, 1966; McNemar, 
1964, to name only a few). Indeed. McNemar (1%4) has pointed out a serious weakness in the 
entire factor analytic process: 

In practtc:»l!y .dl areas of psycboloeicul research the demonstration of trivially 
small minutiae u; doomed to failure because of random enors. Not so if your 
technique is factor analysis, despite its being based on the correlation 
coefficient that slipperiest of all statistical measures. By some magic, hypotheses 
axe tested without significance tests. This happy situation permits mc to 
announce a Principle of Psychological Regress: Use statistical techniques that lack 
inferential power. This will not inhibit your power of subjective inference. 

In the same article (a discussion of the concept of intelligence), McNemar finds no advantage 
of fractionating general mental ability into dirferenlially weighted independent separate factors, even 
in predicting meaningful criteria. The problem of finding separate rating "factors" is quite 
analogous. We have no convincing evidence that separate rating statements will provide data that 
are more useful than one global rating of all-around excellence. There may not he any set of 
rating "factors'* in the simple structure .sense. 

2. Theory Weakness. If rating "factors'* exist, it is not at all clear in what direction they 
may lie. There is no widely accepted theory which provides clues to the researcher to aid him in 
his search. Without such clues, the number of descriptive qualities, interacting with ways of 
expressing those qualities, is literally almost endless. This is one reason so much effort has been 
expended in the search for the best rating statements. In test theory, it is known that certain 
factors (e.g., verbal, numerical) are stable and replicable -although some have questioned the utility 
of large factor sets. !n rating theory, we do not even know the best format for collectJT^g data, 
much less which constructs are more likely to yield useful information. 



7 



3 



It is not even clear how rating questions should be worded. For example, there has b en 
considerable controversy over whether statements oriented around tasks performed ( adjusts the 
on thTc utcfpedal") or statements oriented around personal characteristics of ^e ratee 
rtrcefuTand dominant in interpersonal relations") are more useful in describing ratees (Kavanagh 
197 Massey MuUins. & Harles, 1978). Generally, on this issue, task-oriented statements appear to 
i! sUghtrb tter by internal psychometric standards (slightly less infiated means, slightly I^er 
sLd^d deviations, larger reUability coefficients), but no differences are "^-"V ^^^"^ 
evLtion of the rating statements is made by applying an external cnter.on. This problem of 
heoTweakriess is mu<i more severe in a search of rating data for rating factors than it is in a 
seS^ of test score data for intellectual factors, because the universe of discourse is so much 
larger and harder to define. 

3 Differential Description. Even if reasonably good factors could be deduced from some 
theory and even if they reaUy were present in a given rating situation, there is no assurance that 
S^r<:^uTd »r demonsuated from the'rating data coUected. All psychologists are ^^^^^.tr^ 
halo phenomenon in ratings, and the halo effect is possibly strong enough ^^^ tf^ average e 
simply cannot produce differentiation among ratee charactenstics sharp enough and objective 
eZi t^^J the factors would show up in the data analysis. But the first burden of a set of 

at^nffacTors-if they are really worthwhile -must be to describe differentially the members of a 
rat e ^oup I al't of rating statements does not paint a unique picture of each ratee with 
coKTiSable' differences between his picture and that of each other member, it is difTicult to see 
how^at set of rating statements could produce useful validities against any reasonable outside 
criterion. 

4 Criterion Problems. Ratings are usually coUected to serve as a criterion, rather than 
oredictor variable One of the reasons rating data are used is that the investigator can On d no 
ifr waV Tf measuring the variable of interest. Therefore, the rating is the most "ultimate score 
one cJcoS ct There is no available metric closer to the true score than the rating themselves 
H one accepts the position that the rating score is the ultimate criterion, then of course one 

an : que^i^n iL "^idity. It is by definition perfectly vaUd. In such a c^^ one c^ only 
investigate certain internal psychometric characteristics, such as its rehabihty (Remmers 1934 p^ 
621) Tn some situatioas. particularly in the operational use of ratings, this can sometimes be a 
reasonable position. 

In doing research on rating methodology, however, it seems essential to have ^onie other 
criterion available which allows one to compare the "goodness" of one rating or set of rating^ 
another. It is of little use to compare reUabilities. means, standard deviations, and othe 
Enal psychometric characteristics if one is trying to determine which set of ratings is better at 
measuring a particular condition. 

Previous studies in this series of investigations (Curton. Ratliff. & MuUins. 1979; Massey. 
MuUins & Earles. 1978) have attempted to discover qualities of "good" rating statements 
combed with "poor" rating statements. The methodology has been what one might expect from 
heTffJles Leled 3 and 4. above, discussing differential description and cnterion problems 
re;ec ti^S. Different sets of rating statements were compared by ob.rving jelat- -er.^ - 
difSrentiaUy describing ratees. and in their prediction of external criteria. None of the sets of 
rating statements investigated so far have shown any superiority over any of the others. 

Judging from the results avaUable so far in the literature, it may well be that P^haps all 
that the Average rater can do effectively is rate on some general overall idea of excellence and 
tJa reqJring the rater to rate separate characteristics independently is beyond a person s 
SnabilTtv "f this is so it is another way of saying that halo error overwhelms the variance n 
r o 'ratLg tatement; and that sets of rating statements are not more efficient ^a" a sm^e 
Tarmg This I a study of the relative effectiveness of requiring raters to rate varying numbers of 
statements, with effectiveness defined by criteria external to the ratings. 



8 



n. METHOD 



Rating Rationale 

The two studies cited previously were based on the premise that if rating statements are 
actually meaningful, raters should be able to identify unlabeUed profiles made by their peers from 
those rating statements. The two previous studies indicated that correct identifications (hits) are 
too few to provide a reasonable degree of sensitivity. The average number of hits per rater, rating 
14 peers, is only about 2.5. Therefore, a refined method of hits was also used, called the rank 
order (RO) method. 

The RO method provides a way to credit near misses. The hits approach is all or nothing. 
The rating subject either guesses the profile correctly or does not. The rater may be sure that the 
profile being studied is e;.her, say, peer B or peer F, so commits to B. If th^ profile really 
belongs to peer F, the rater not only gets no credit for being close on the identification of peer 
Fs profile, but also misses peer B's profile as well. The RO method, though a little difficult to 
understand, is an approach designed to make the hits method more sensitive. 

If, in addition to being asked for absolute hits, the rater is asked to rank the unidentified 
profiles in terms of some standard of excellence (e.g., how weU the people with these profiles wfll 
do in a course of instruction), and if the rater is also given a list of the names of peers and is 
af,ked to rank these peers on the same standard of excellence, and then the rank differences are 
analyzed, this is in a real sense a measure of hits which gives credit for near misses. For example, 
if the rater believes correctly that the three profiles which appear to be the "best" in terms of 
most likely to succeed belong to peers B, C, and F, but is not sure which is which, the ranking 
approach will provide credit for placing these peers in the proper end of the ranking, whereas 
absolute hits might give the rater no credit at all. 

Another external criterion for judging the relative efficacy of various sets of rating statements 
can be some typical success criterion, such as the final grade upon graduation from school. This 
criterion is not quite as drect as peer identification for judging the quality of peer ratings, 
because an additional element, validity of chosen statements, becomes a consideration. Using this 
criterion, not only must each rating statement contribute to a sound differential description of the 
ratee, but also it must be a statement which happens to be vaUd for that success criterion in 
order for differences among rating statements to appear. 

For example, one might calculate the correlation coefficient between the criterion and a set 
of five rating statements and then calculate another correlation coefficient between the criterion 
and 10 rating statements, of which five were the same statements used in calculating the first 
correlation coefficient. By applying the proper statistics, one can then determine whether or not 
the set of 10 statements predicted better than the subset of five statements. Whether they do is 
determined not only by the quality of the ratings as descriptors of the ratees but also by the 
validity of the quality described in the rating statement for the chosen criterion. One should be 
able to rate one's peers rather accurately on, say, height, out that would probably not be a valid 
predictor of academic ability. Therefore, it is believed that the peer identificarion process is 
probably the most direct method of judging the relative accuracy of two sets of rating statements. 
Both approaches were used in this study. 

Subjects 

Nine seminar groups of Air Force non-commissioned officers (NCOs) (technical and master 
sergeants) assigned to the Air Training Command (ATC) NCO Academy at Lackland AFB Annex 
served as subjects for this study. Seven of the seminar groups were composed of 15 subjects each, 
one of 14 subjects, and one of 13 subjects, yielding a total N of 132. Length of military service 
for these subjects was 10 to 17 years. 



5 



9 



Procedures 

The nine seminar groups were randomly assigned to tiiree treatment conditions of three 
seminar groups each. The subjects in Treatment Condition 1 were asked to rate the.r Peer^ on five 

Iting statements. Those in Treatment Condition 2 were asked to rate the.r peers on 10 rat ng 
statements, five of which were used by Treatment Condition 1. In Treatment Condition 3 the 
fubieX rated their peers on 20 rating statements, including the 10 used by Treatment Cond.t.on 
2 S 20 of the ratSig statements are given in the appendix. It should be noted that some of the 
statements are very general and person oriented in nature (e.g. 1 2. 10) while others are more 
specific and job related (e.g., 6, 15, 19). This design provided 45 subjects on whom 20 rated 

tatments were avaUable, 89 on whom there were 10 rated statements, and 132 who had aU 
rated the same five statements. Summary statistics for all nine seminar groups and three treatment 
conditions are available in Table 1. 

Table 1. Number of ProfBe Identifications (Hits) 
by Treatnient and by Seminar Group 



Results 



Treatment 1 
S Statements 



Seminar Group 



Treatment 2 
10 Statements 



Treatment 3 
20 Statements 



Group 
N 

Total Hits 
Mean Hits 
SD Hits 


15 

31 
2.07 
1.65 


15 

36 
2.40 
.95 


13 

35 
2.69 
1.49 


14 

37 
2.64 
1.55 


15 

45 
3.00 
1.90 


15 
40 
2.67 
1.24 


15 
36 
2.40 
1.54 


15 
40 
2.67 
1.92 


15 
34 
2.27 
1.33 


Treatment 
Total N 
Total Hits 
Mean Hits 
SD Hits 




43 

102 
2.37 
1.42 






44 
12? 
2.77 
1.07 






45 
110 
2.44 
1.63 





T.Ratios 

Treat men* 1 Versus 2 ^ -'^^ 

Treatment 1 Ve«us 3 ^ = -^^^^ 

Treatment 2 Versus 3 * = ■^'^^ 



2 

Not significant. 



After the ratings had all been collected, rating scores were averaged across raters, and profiles 
were constructed, ol for each ratee (see Appendix A for examples). The ratee s name was left 
Tf the profile but the profile itself was reproduced in sufficient cop.es that all members of the 
group could have all the profiles of all their seminar group peers, unidentified as to name. In a 
Scond visit to the seminar groups, the subjects were given three more tasks to per orm, m the 
foUowing order. The first task was to study each of the profiles and rank the proffles accordmg 
to how "a person" with that profile should do in ihe class tha subjects were takmg. The second 
task was to match each of their peers with one of the proflfe (ihat is, indicate to whom each 
profile belonged). Third, each subject was given a list of the people in the semmar group and was 
told to rank them according to how well they would do in the course. 

The data were subjected to two analysis of variance treatments (Tables 2 and 3), and then 
they were reanalyzed using multiple linear regression analysis (Tables 4 to 7). 



6 lO 



in. RESULTS AND DISCUSSION 



Table 2 shows that there were no differences among treatment conditions in the number of 
"hits.** Those subjects who were looking at profiles made from 20 rating statements could identify 
their peers no better than those looking at profiles made from five statements. Number of hits, as 
mentioned above, is a relatively insensitive measure of peer recognition, however (Table 1 shows 
that each group averaged about 2.5 hits). 



Table 2. Analysis of Variance of Number of Correct 
Profile Identifications (Hits) by Treatment 
and by Seminar Group 



Source 


Sum of 
Sq uar«s 


DF 


Mean 

Square 


F 


Treatment 


3.739 


2 


1.870 


2.172^ 


Seminar Groups Within Treatment 


5.168 


6 


.861 


.340^ 


Error (Within Groups) 


311.717 


123 


2.534 





^Not significant. 



Table 3 shows the results of analysis of variance treatment of the squared differences 
between the ranking of the unidentified profiles and the ranking of peers. Again, there is no 
significant difference among groups, even using this more sensitive measure of peer identification. 



Table 3. Analysis of Variance of Squared Deviations between Unidentified 
Profile Rankings and Peer Rankings by Treatment and by Seminar Group 





Sum of 




Mean 




Source 


Squares 


DF 


Square 


F 


Treatment 


15080.951 


2 


7540.475 


1.29^ 


Seminar Groups Within Treatments 


349887.789 


6 


58314.631 


2.074^ 


Error (Within Group) 


3458429.590 


123 


28117.313 





^Not significant. 



Intercorrelation matrices among the various groups of rating statements and final school grade 
appear in Tables 4 to 6. The most striking aspect of these correlations is their size. The 
reliabilities of the separate rating statements are unknown, but it appears that the intercorrelations 
of each rating statement with the others must approach the statement reliabilities. For example, in 
Table 4, 143 of the 190 intercorrelations among rating statements are .70 or higher, and 30% are 
.80 or higher. Ail this argues rather strongly for the likelihood that little is being rated except a 
general idea of excellence. 

The one worrisome feature of these intercorrelation matrices is the fact that there are some 
sizable differences among the 20 rating statements in their validities against final school grade, 
with Statement 1 exhibiting the highest relationship with the criterion. In each of the three 
matrices, this finding seems to argue against the proposition that the rater can evaluate only in 
general terms; otherwise, how could one statement be more valid than another? However, there are 
two possible explanations for the higher validity of Statement 1. 



7 

u 



Table i Intercorrelations, 20 Rating Stitementj, and Final School Grade 



Rlllm Stitifliintt 



00 



1 

2 
3 
4 

5 
6 
7 
8 
9 
10 
II 
12 
13 
14 
IS 
16 
i? 
18 
19 
20 
FSC 

Mean 

SO 



.16 ,85 ,57 .67 
1.00 ,85 ,77 ,73 
1.00 ,75 .82 

1,00 ,67 
1,00 



63 


.81 


,67 


.76 


.75 


.78 


,82 


.57 


75 


.75 


.78 


,87 


.84 


.80 


,79 


.63 


76 


M 


.71 


,81 


,85 


,85 


,79 


,72 


84 


.75 


.64 


,70 


,84 


.74 


.59 


,65 


.74 


.85 


,61 


.84 


,75 


.82 


,67 


,74 


.00 


.75 


.66 


,77 


.83 


,72 


.61 


,65 




1.00 


,76 


.83 


,87 


.91 


.76 


.64 






lilO 


.81 


.80 


.80 


.81 


,61 








1.0O 


,82 


.85 


,77 


,73 










1,00 


.88 


.76 


.67 












1,00 


^2 


.70 














1.00 


.57 
















1.00 



',56 .47 .43 ,40 .42 .41 .41 



l4 


It 

19 


II 


11 

1 r 


11 


II 


20 


FSQ 


,62 


.70 


,70 


.48 


.67 


.73 


.63 


n 

,/) 


,71 


.76 


.74 


,63 


.73 


.69 


.67 


jO 


,74 


,80 


,74 


.65 


.74 


,81 


,73 


.59 


,73 
,80 


.72 


.65 


,68 


.')9 


.65 


.69 


.35 


,81 


,79 


,67 




,84 


.82 


.36 


.78 


.76 


.75 


.71 


.75 


.75 


.74 


.43 


,i6 


.80 


.79 


.64 


.79 


.81 


.76 


.51 


.62 


.71 


.66 


.52 


.72 


.73 


.53 


.44 


,82 


.85 


.81 


'.70 


.81 


.82 


.77 


.44 


,77 


.83 


,76 


.62 


,78 


.81 


.71 


,47 


Jo 




,10 


.00 


77 


82 


71 


,J0 


,60 


.71 


.68 


.48 


,65 


.68 




il 


.76 


.79 


,68 


.73 




,10 


.0/ 




1.00 


.85 


.83 


.78 


.81 


.86 


.83 


.33 




1.00 


.81 


,71 


.79 


.87 


.78 


.44 






1.00 


.76 


.86 


.83 


.79 


.43 








1,00 


,84 


,76 


.75 


.21 










1.00 


.88 


.77 


.41 












1.00 


.80 


.40 














1.00 


.32 
















1.00 


3.3 


3.5 


3.4 


3.5 


3,3 


3J 


3.7 


322.6 


.42 


.35 


.42 


.46 


,43 


,46 


,35 


22.4 



13 



12 



ERIC 



Tabic 5. Intcrcontlations, 10 Rating StatcmcnU, and Final School Grade 













R«tin« SUttmtntt 










FSG 

— — 


1 


2 


3 


4 


5 


« 


7 


• 


9 


1 0 


1 1 


1 


1.00 


.71 


.85 


.64 


.65 


.49 


.81 


.72 


.82 


.73 


.71 


2 


1.00 


.83 


.81 


.79 


.73 


.80 


.82 


flit 

.84 


Q1 
.O / 


Ad 


3 




1 .00 


ill 


.80 


.69 


.92 


.80 


.87 


.87 


.64 


A 
*t 








1.00 


.73 


.74 


.82 


70 


.75 


.88 


.41 


5 










1 .00 


.74 


.84 


.68 


.78 


.78 


.44 


6 












1 .00 


.71 


.58 


.63 


.78 


.39 


7 














1 .00 


.81 


.86 


.87 


.61 


8 
















1,00 


.87 


.81 


.52 


9 


















1 .00 


.82 


.54 


10 




















1.00 


.50 


FSG 






















1.00 


Mean 


3.2 


3.1 


3.3 


3.5 


3.5 


3.6 


3.4 


3.3 


3.3 


3.5 


328.3 


SD 


.57 


.54 


.47 


.43 


.44 


.44 


.44 


.56 


.45 


.46 


22.8 



Tabic 6. Intcrcorrclations. Five Rating Statements, 
and Final School Grades 

(,V = 132) 



R<tln« SUttmtnti «0 



1 2 3 4 5 « 



1 I.OG .71- .86 .64 .66 .68 

2 1.00 .82 .80 .79 36 

3 IM .80 .80 .61 

4 1.00 .74 .38 

5 1.00 38 

FSG '-^ 

Mean 3.2 3.1 3.3 3.4 3.4 329.1 

SD .55 .52 .44 .43 .43 21.8 



The most obvious explanation is that Statement I (Learning AbiUty -acquires knowledge 
accurately and quickly) describes the factor among the set of 20 which is most imporUnt for 
success in this school environment. However, the learning at the NCO academy does not appear to 
be of the kind which taxes learning abiUty, such as difficult academic subjects might. Looking in 
from the ouUide, it appears that several others should be at least as important for success in this 
particular school (e.g.. leadership, quality of work, moUvation, knowledge of duties). Furthermore, 
if the raters really are making a distinction between learning ability and the other statements, it is 
dimcult to explain the high intercorrelations among the statements. 

An alternative explanation is that the order of presenUtion of the statements explains the 
higher vaUdity of learning ability for the success criterion. This argument impUes that whatever 
sutement was presented fmt would exhibit the highest validity, and learning abiUty jiat happened 
to he the first In the series. Assuming that the raters reafly cannot coiuider the separate 



9 

i4 



:£^l^rr c: a rrlrrrVaced ... ... .as. or ..mg .e second ana 

sutedS; statements' the rater perceives ar, in.pUcation that these oO^er ^ ^ ^ 

somehow different from the first. So an implicit requirement .s S-^"- ed by the mech an s of t.^ 
situation to rate the second and succeeding statements different from the first, but there is not 
resumption) ^ abUity to do so accurately. If t>us scenario is accurate, .tercorrelat.on 
matrices simUar to those displayed in Tables 4. 5. and 6 should result. 

There is no way witliin the limits of this study to determine which of these two 
explan^o s I orre^but another study in the same context varying the order of presentation 
olf tTe statements might clarify the relationships and should be easy enough to accomplish. 

The results of four regression analyses appear in Tabic 7. The first question to be addressed 
bv tlu? tart "Wh n five rating statements are available on a set of ratees, is anytlung of value 
le^brconJiderrg an additional 15 rating statements when one is predicting some meaningful 
criterion such as school success?** 



Table 7. Regression Analyses of Varying Numbers 
of Rating Statements 



Probltm 


Rating 
StaUmcnts 


r2 


N 


A 


1-20 


.693 


45 




1-5 


.605 


45 


B 


1 10 


.575 


89 




1-5 


.549 


89 


C 


1,19 


.614 


45 


1 alone 


.566 


45 


D 


K 19.9 


.623 


45 




1, 19 


.614 


45 



Difference 



.088 .458^ 



.026 .872* 



5.212* 



.00'? .991 = 



*Not significant. 
•Significant at the .OS level. 



Only 45 of the subjects (Treatment Condition 3) were available to investigate this question. 
• 1 11 c. hiect, ra ed their peers on all 20 rating statements. Problem A in Table 4 shows 
^^tJl::XlX^^ — between I full n.del (using all 20 statements) 
and the restricted model (using only five statements). 

The next quesUon Uiat arises is. "Are 10 statements better than five in predicting school 
succeJ!^' Zl pertinent to this question were available from 89 subjects, and are shown in 
problfm B. Again there is cleariy no significant advantage in using 10 statements, rather than five. 

FinaUy, it seems important to ask. "What is the smallest subset of the ^0 rating statements 
which carries the predictive burden of the entire set?" Problems C and D in Table ^ ^"^d^^^^ ^ ^ 
^ue ^T^5 Jitcis in Treatment Condition 3 were used, since these were the only subjects 
who rated the entire set of 20 statements. 



10 



When rating statement 1 (Learning Ability) is the only variable in the prediction system, the 
is .566. Adding rating statement 19 (Knowledge of Duties) to the prediction system increases 
the to .614, an increase which is significant at ^he .05 level. When both rating statements 1 
and 19 are u\ the prediction system, the additioi: o^" any of the other 18 statements does not 
improve prediction significantly. 

It is not unusual that the results of a study support a simply stated hypothesis only 
partially. This study began with the hypolliesis that untrained raters can rate only on some 
general idea of excellence and that ratings cannot be made better by requiring the rater to rate 
several separate characteristics. This hypothesis was tested, using two different external criteria of 
goo dness-of-ra ting scores. 

When an external criterion of recognition of ratee profiles is used to evaluate the goodness 
of sets of ratings, the results indicate that sets of rating statements larger than five do not 
provide better recognition of peers. Tlie analysis of variance design did not permit any conclusions 
Concerning whether five rating statements produced significantly better recognition than one 
statement. 

When the external criterion is class standing and the ?0 rating statements are subjected to 
multiple linear regression analysis, the results indicate unequivocally that large sets of rating 
statements do not provide better measurement than small sets. Apparently, however, a single rating 
statement does not carry as much predictive power as two. One does not know, of course, 
whether a single rating statement deliberately designed to be as broad and global as possible might 
have provided all the prediction attainable from all combinations of "factor" statements-smce the 
20 statements studied were selected partially on the basis of their apparent independence of each 
other. But that does not change the fact tJiat the beginning hypothesis had to be rejected in 
favor of an alternate one that states raters can effectively use only a very small subset of the 20 
rating statements investigated in this study. Future studies will determine whether a single global 
rating statement will provide all the useful information available from any number of "factor" 
rating statements and will find out in several different contexts what is the most likely maximum 
number of useful '"factor" rating statements. 



I6 



REFERENCES 

r„wo« ED RatUff F R. & MuUins. CJ. Content analysis of rating criteria. Chapter Xlil.l in 

Mn ^elpme^t for iob performance cMon: l>r.>ccc Jim from symp^num. 

AFHRrTR^is AD-AOOO 000. Brooks AFB. TX: Personnel Research D.v.s.on. A.r F .ce 

Human Resources Uboratory. Februa'y 1979, 116-122. 
Horn. J.L. OrganizaUon of abUities and the development of intelligence. Psychoto^ca! Review. 1968. 

75(3). 242-259. 

Humphreys. L.C. TUe organization of human abilities. American Psycholo^s. 1962. 17. 475-483. 

Jensen. A.R. Individual differences in concept learning. Chapter 7 in Analyses of concept learning. 

New York: Academic Press, 1966. 
Kavanagh. MJ. The content issue in performance appraisal: A review. Personnel Psychology. 1971. 

24. 653-668. 

RU Mullin, CJ & Earies. J.A. Performance appraisal ratings: The amtent issue. 
"^^VmClK^T,: AD-Aot 690. Brooks AFB. TX: Personnel Research Division, A.r Force 

Human Resources Laboratory, December 1978. 
McNemar, Q. Lost: Our intelligence? Why? American Psychohgist. 1964, 19, 871-882. 

Remmen, H.H. Reliability and halo effects of high -hool and college students' judgment of their 
teachers. Jotirnal of Applied Psychology. 1934. 18, 619-630. 



12 



APPENDIX A: EVAU NATION FORMS 
EVALUATION FORM 



Wtll 

B«iow Above Abovt 

Av«ra9t Av0ra9« Avmrage Average Outotandlng 



1. Learning AbUity - acquires knowledge accurately 

and quickly (A) (B) (C) (D) (E) 

2. Leadership -effectiveness in getting ideas accepted 

and in guiding others to accomplish a task (A) (^) 

3. Quality of Work - produces work of high quality (A) (B) (C) (D) (E) 

4. Motivation - strong desires to accomplish goals 

and objectives (A) (B) (C) (D) (E) 

5. Follows Instructions - follows directions as 

prescribed (A) (B) (C) (D) (E) 

6. Bearing and Behavior - maintains professional 

conduct and appearance (A) (B) (C) (D) (E) 

7. Accuracy - precision and carefulness in work 

performance (A) (B) (C) (D) (H) 

8. Oral Communication - expresses ideas clearly, 

Icijjcally, and grammatically in conversation (A) (B) (C) (D) (E) 

9. Problem Analysis - identifies and analyzes 

problems which require action (A) (B) (C) (D) (E) 



10. Initiative - self-starting, rarely needs a push to get 
going 



(A) (B) (C) (C) (E) 



1 1 Quantity of Work ~ accomplishes a large amount 

of work (A) (B) (C) (D) (E) 

12. Written Communication - expresses ideas clearly 

in writing with good grammatical form (A) (B) (C) (D) (E) 

13. Punctuality - prompt in keeping engagements (A) (B) (C) (D) (E) 

14. Adaptability - changes attitude and behavior to 

meet the demands of the situation (A) (B) (C) \,0) (E) 

15. Depen dabili ty - does assigned tasks 

conscientiously without close supervision (A) (B) (C) (D) (E) 

16. Emotional Stability - stability and calmness 
under pressure and opposition 

(A) (B) (C) (D) (E) 



13 



18 



Evaluation Form 




Baiow Above Above 

Average Avenge Average Average Outitanding 



17. Human Relations - gets along well with fellow 
workers and works effectively with them 

18. Judgment makes good decisions among 
competing alternatives 

19. Knowledge of Duties - understands tlie 
requirements for effective work performance 

20. Honesty - straight-forward and truthful in dealing 
with others 



(A) (B) (C) (D) (E) 

(A) (B) (C) (D) (E) 

(A) (B) (C) (D) (E) 

(A) (B) (C) (D) (E) 



1. Lcacnias Ability " 



Profile 



Below 
Av erage 



Aven ge 



Above 
Average 



Well 

Ab<'*v. 
Avri jiHf! 



Out- 



;«cqulrcs knovlcdse ac^uratftly and qtilckly 

2. Lcadarahlp - afff^tlve- | I i ■ i ■ ' ' * * | ' ' iiIk 
uaa tn gcttlns acceVted and la guiding othera to acco«pll«h 



3. Quality of Vork - L-i— * 
producca work of Klgh quaAty 



4. Hotlvatlon - atrong 



I- 



desires to accoapUsh goala and objectives 



5. Pollown Inac ructions ~ 

follows directions as prescribed 



tsak 



1 1 ■ '-{ 



14 



J9 



ev«lujCtoa fort. V 

Profile 




15 

20 



ScKlnar 



Evaluation ?oru VI 

Profile 



1. U«nilns Ability - 
•cqulrss knovlcdts accurately and quickly 

2. Ijtaderahlp - affective- | > i 

ocaa in getting Ideas eccepted end In gtildlng 

3. Quality of Uork - i i i i ^ 
producee work of high qualltV 

&. Hotlvetloo - eCrong | i i i * 

deelree to eccoaplleh goele end object ivee 

follove IneCrucClone - I * ■ • ' 

follow* direct lone ee preecribed 



5. 



6. Beer log and Behavior 



■alnteine profeeelonel condubt end eppeerance 



7. 



8. 



10. 



11. 



12. 



13. 



li. 



IS. 



17. 



19. 



20. 




Quantity of Uork " 
ecco^llehee e lerge aK>unt 

Vrltten Co^unlcatlon 



Urfctcn CoMunlcecion - ■- i i i i . . . 

l^re^e^T^eee cleerly in wVlting vlth good graUatlcel lor, 

Puoctuellty - proin»t |- ' ' I 

In keeping engegenente 



judgMnt - 
■akae good declelone enong 



' I * * 

etlng altemetl^ie 



21 



AUitoovwfwoiTPtiimiiBometasT*- 67i-o56/m9 



